1 Bib Clustering Analysis

1.1 Background

Bib numbers are unique numbers used to identify each runner before, during, and after the race. During the race, the bib number is actually worn by the runner as a unique identifier. In some races like the Boston Marathon, bib numbers are given out in batches and used to organize the waves in which runners start a race. To make the start of a 26,000 person race more organized, the Boston Marathon in 2017 broke the runners into four, color-coded groups. To determine what group (or wave) a runner would be in, the marathon organizers used previously submitted qualifying times, as detailed below (“Registration”, 2019)

Red bibs (numbers 101 to 7,700) are assigned to Wave 1 (starting at 10:00 a.m.). White bibs (numbers 8,000 to 15,600) are assigned to Wave 2 (starting at 10:25 a.m.). Blue bibs (numbers 16,000 to 23,600) are assigned to Wave 3 (starting at 10:50 a.m.). Yellow bibs (numbers 24,000 to 32,500) are assigned to Wave 4 (starting at 11:15 a.m.). The break between Wave 1 and Wave 2 is a 3:10:43 marathon qualifying time. The break between Wave 2 and Wave 3 is a 3:29:27 marathon qualifying time. The break between Wave 3 and Wave 4 is a 3:57:18 marathon qualifying time.

The question at hand is can we develop an unsupervised clustering model that accurately identifies these groupings without using the information from the above paragraph? An additional question is can we confirm that the fourth group also includes runners who did not have to qualify for the marathon but instead running for a charity group.

1.2 Data Cleaning and Exploration

We first need to convert the bib number from a factor to an integer. We have to convert the factor to a character first though, because directly converting a factor to an integer returns the underlying factor level, not the integer a factor may represent.

Next, we can plot the finishing time against bib number and start to see several trends. This plot is rather dense, so we can use a density scatterplot to better see the distribution of the data.

First, finishing times slowly yet steadily increase, supporting the theory that faster finishers get lower bib numbers. Second, there are about four observable clusters, which match the waves organized by the Boston Marathon at the start. Finally, the last group has much more variance within it, and far slower average finishing times. These are likely the bib numbers of charity runners and other runners who did not need to qualify for the race.

Now we can label the data with the right group names so we can compare our model’s output.

1.3 K-Means Clustering

We can try K-means clustering to see if the algorithm can successfully identify the known clusters. Since there are four known clusters, we will provide ‘4’ as a parameter for the K-means algorithm. Additionally, since K-means starts with a random division of elements, we will set the random seed at one and run K-means 20 times, keeping the most accurate model.

As you can see in the above graph, K-means successfully identifies all four clusters and their break points perfectly. This model works well here in part because of the breaks between the four clusters.

We can also build a confusion matrix to ensure the K-means clusters map to the correct wave group, but the cluster number assigned by K-means is random and only matches the correct wave. We’ll have to manually verify the cluster number of the prediction maps to the wave group in our original data.

##           Reference
## Prediction    1    2    3    4
##          1    0    0    0 6485
##          2    0    0 6713    0
##          3 6552    0    0    0
##          4    0 6614    0    0

1.4 Conclusions

From our analysis so far, the K-means clustering appears to be the best model for this problemset. Because it is a top down approach that fits the data into the number of clusters provided as an input to the model, it does an effective job of successfully finding the break points for this data.

2 Distance Traveled Analysis

2.1 Background

In our previous data exploration, we conducted extensive analysis of how a runner’s hometown affected finishing time. We found strong differences amongst home states and home countries, with faster runners coming from African countries like Ethiopia and Kenya and slower runners coming from New England states near the start of the marathon.

We were curious to see if the distance traveled had a more direct relationship with finishing time. To answer this question, we had to do some feature engineering. In data science, we rarely have all the data needed and frequently need to build our own features based on some aspects of the existing data.

To attack this problem, we leveraged work from a previous class using Python to pass place names to Google Map’s geocoding API, which returns locational metadata including latitude and longitude. To determine distance between a runner’s provided hometown and the race start, we built a custom function to calculate the distance using the haversine formula. All details are included in the attached Jupyter Notebook and html files. Once the calculations were complete, the distance traveled for each runner was merged back with our original dataset using bib number as a unique identifier.

2.2 Linear Regression Fitting and Analysis

First we will build a linear model, and then use that model to predict the finishing time for each runner as a function of the distance traveled from their hometown. Once we build the model, we can plot the actual data with the predicted data as a line. As you can see from the plot below, there appears to be a very weak negative correlation between distance traveled and finishing time. Let’s examine the model output for more details.

We can further analyze the model by looking at specific outputs. The R-squared value of this model is 0.00713. The p-value for the distance term is 0. The coefficient for the distance variable is -0.00203. The VIF is 1.

These outputs confirm what we saw from the plot. For every mile further away a runner’s home town is, a runner is expected to run 0.00203 minutes faster. However, we can see that this model is a very poor fit for the data with very high error residuals.

2.3 Conclusions

This is an excellent example that really detailed feature engineering and even a very small p-value does not lead to a good model. Distance traveled results in a low adjusted R-squared value of 0, which is definitely statistically significant. However, the model has little predictive power because of its low adjusted R-squared value and obviously large error residuals.

3 Rank, Gender and Age Analysis

3.1 Linear Regression

3.1.1 Fourth Quarter Performance

Because we know from our previous analysis that the fourth quarter rank contributes the most to the final rank most, we modeled how people’s previous performance, represented by their bib-number (as discussed in an earlier section), their age and their gender affects their fourth quarter rank.

In a model predicting the fourth quarter rank by bib number, gender, and age, slope/intercept value is 1.66e+03 and is significant (p<.001). All predictors, bib number, gender, and age significantly predicts total users (p<0.001). One unit increase in bib number positively impacts (+0.557) fourth quarter rank (p<0.001), meaning increased bib number (slower past performance) is likely to predict higher fourth quarter rank (slower fourth quarter performance). Being male (vs. female) positively impacts (+1420) fourth quarter rank (p<0.001). One unit increase in age positively impacts (+45.6) fourth quarter rank (p<0.001), meaning older people will likely have higher fourth quarter rank (slower fourth quarter performance). R2 (adjusted) is .434, which means that 43.4% of variability in fourth quarter rank is explained by bib number, gender, and age. The VIFs are 1.17-1.30, representing low multicollinearity.

3.1.2 Overall Performance

We also looked to see if rank in each quarter of the race, age, and gender, to see their effect on final official time and the effect on overall rank.

Ranks in each quarter of the race, gender, and age all have significant effect on official finishing time. However, VIFs of rank in 0-10k, 10-20k, and 20-30k show that the independent variables are highly correlated. Considering our previous analysis that the fourth quarter performance contributes the most to the final rank, we rebuilt new model after dropping the first three quarters’ rank.

In a model predicting the official time by fourth quarter rank, gender, and age, slope/intercept value is 173 and is significant (p<.001). All predictors, fourth quarter rank, gender, and age significantly predict total users (p<0.001). R2 (adjusted) is .832, which means that 83.2% of variability in the official time is explained by fourth quarter rank, gender, and age. The VIFs are 1.09-1.13, representing low multicollinearity.

One unit increase in fourth quarter rank positively impacts (+0.0048) official time. Being male (vs. female) negatively impacts (-10.70) official time. One unit increase in age positively impacts (0.1792) official time.

In the improved overall rank predicting model, p value is down to less than 0.001, which means all variables are significant now. And VIF are also between 1.09 to 1.13, which means low multicollinearity. R2 (adjusted) is .9016, which means that 90.16% of variability in the offical time is explained by fourth quarter rank, gender, and age. So the model has been improved in an appropriate way, and fourth quater rank, gender and age all contribute to overall rank.

One unit increase in fourth quarter rank positively impacts (+0.8916) overall rank. Being male (vs. female) negatively impacts (-2521) overall rank. One unit increase in age positively impacts (58.24) overall rank.

3.2 Official Time Distribution among Different Group

From the following two plots, we can see overall rank and official time’s distribution are related to the fourth quarter rank, gender, and age.

3.3 KNN Analysis

And we explored whether we could predict participants’ ranks in 4 categories if we got their information about age, gender and the rank of fourth quarter. We classified official time into four parts by order of official time.

Then we ran the KNN test.

It seems 12-nearest neighbors (with accuracy rate of 72.5%) is a decent choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off. So we could predict the level of the participants by their age, rank of fourth quarter and gender. The cross Table of Actual/Predicted is the following:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  8693 
## 
##  
##                  | class_pred 
## class.testLabels |         1 |         2 |         3 |         4 | Row Total | 
## -----------------|-----------|-----------|-----------|-----------|-----------|
##                1 |      1752 |       422 |        37 |         0 |      2211 | 
##                  |     0.792 |     0.191 |     0.017 |     0.000 |     0.254 | 
##                  |     0.879 |     0.192 |     0.016 |     0.000 |           | 
##                  |     0.202 |     0.049 |     0.004 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|-----------|-----------|
##                2 |       237 |      1333 |       563 |        36 |      2169 | 
##                  |     0.109 |     0.615 |     0.260 |     0.017 |     0.250 | 
##                  |     0.119 |     0.605 |     0.243 |     0.017 |           | 
##                  |     0.027 |     0.153 |     0.065 |     0.004 |           | 
## -----------------|-----------|-----------|-----------|-----------|-----------|
##                3 |         4 |       438 |      1362 |       371 |      2175 | 
##                  |     0.002 |     0.201 |     0.626 |     0.171 |     0.250 | 
##                  |     0.002 |     0.199 |     0.588 |     0.170 |           | 
##                  |     0.000 |     0.050 |     0.157 |     0.043 |           | 
## -----------------|-----------|-----------|-----------|-----------|-----------|
##                4 |         0 |         9 |       355 |      1774 |      2138 | 
##                  |     0.000 |     0.004 |     0.166 |     0.830 |     0.246 | 
##                  |     0.000 |     0.004 |     0.153 |     0.813 |           | 
##                  |     0.000 |     0.001 |     0.041 |     0.204 |           | 
## -----------------|-----------|-----------|-----------|-----------|-----------|
##     Column Total |      1993 |      2202 |      2317 |      2181 |      8693 | 
##                  |     0.229 |     0.253 |     0.267 |     0.251 |           | 
## -----------------|-----------|-----------|-----------|-----------|-----------|
## 
## 

4 Split times Analysis

4.1 Linear Regression 1

Let us predict the Official time of the marathon by using all the predictor variables in the data.

In the model predicting the Official time of the marathon using Age, Gender and all the split times, we can see that all the coefficients are not significant. Only the intercept, Age and some of the split times are significant. R2(Adjusted) is 0.99765 which is almost close to 1. This is because we have used all the split times till 40K. So, the model is explaining 99% of the variability in the Official time using all the split times, Gender, Age. But, the main problem is with the multicollinearity in the data. If we look at the vif values of the split times, they are more than 10 and this suggests that there is lot of multicollinearity in the data. We will overcome this by using PCA, PCR. Before that, let’s just predict the Official time of the marathon using Age, Gender and Half time of the marathon.

4.2 Linear Regression 2

##                Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -11.6950930 0.584334181 -20.01439  1.886906e-88
## Age          -0.0160872 0.007772944  -2.06964  3.849580e-02
## M.F1          5.9847386 0.183431769  32.62651 6.437021e-229
## Half          2.2366367 0.005014392 446.04342  0.000000e+00
##      Age     M.F1     Half 
## 1.158590 1.225813 1.232414

In this model, we predict the Official time of the marathon with Age, Gender and Half time of the marathon. All the coefficients are statistically significant. R2(Adjusted) is 0.89934. This means it is explaining 89% of the variability in Official Times with Age, Gender and Half time. And also, VIF values indicate that there is no multicollinearity.

4.3 PCA and PCR

Let us see whether we can apply Principal Component Analysis and Regression to our split times.

Principal Component analysis of the split times suggest that 1st Principal Component i.e PC1 covers 97% of the variance in the data. PC1 and PC2 cover 99% of the variance in the data. We can observe this in the below plot.

Let us perform regression analysis based on the Principal Component Analysis which is Principal Component Regression(PCR). We can validate the R2 based on the number of components used.

In the Principal Component Regression, 1 component cover 97% of the variance in the split times and 92% of the variance in the Official time where as 2 components cover 99% of the variance in the split times and 98% of the variance in the Official time. R2 value is close to .95 if we consider 1 component and it is close to .98 if we considered 2 compoenents i.e 2 principal components cover 98% of the variance in the Official time.

5 Predicting Official Time Using Different Models

5.1 Background

Running the right pace in a marathon is critical for finishing with best official time. The popular marathon pacing strategies that are most often successful are 1) running even splits throughout the race and 2) slowing a few seconds per mile as the race progresses.

The runners who participate in the marathon will be calculating their pace that needs to be maintained in each time split during the training sessions in order to achieve the best official time in actual marathon. Pace is calculated as time taken by the runner per km. For example, if the runner’s pace is 5’ 20", it means the runner has taken 5 mins & 20 secs to run kilometer distance. Lower pace values mean that the runner is faster.

In the below analysis, We will be considering age, gender and pace of the runners calculated till half distance of the marathon(21.1km) as the features and will be predicting the official time of the runners using different models like Linear Regression, Lasso and Ridge, Bagging, Random Forest, Gradient Boosting and xgboost and see which model performs best based.

Additionally, we will look at feature importance that tells us which features are important in predicting the official time.

5.2 Preprocessing and Exploration

There are total 26259 rows after dropping the columns and removing NA and blank values in our data.

Let’s Calculate Pace of runners till Half time. Half time is defined as the time taken by runners to cover half the distance(21.1km) of the marathon.

In the dataset, we have time taken by the runners to cover 5km, 10km, 15km, 20km, 21.1km(Half distance), 25km, 30km, 25km, 35km, 40km, 42.2km(Official time). Using these features, we will be calculating the time splits and pace of the runners in each time split.

5.2.1 Extracting features and Targets

Let’s extract age, gender, pace till Half time and official time columns from the dataset.

5.2.2 Train and Test splits

Let’s split our data into 70% train set and 30% test set and set the seed value so that there is no randomness in our split data.

5.2.3 Correlation Matrix

Next let’s plot correlation matrix on training data to observe the relationship between the features and also with the target.

From the above plot, We can observe that there exists a multi-collinearity between the pace variables. Also we can see that the pace variables are strongly correlated with official time.

5.2.4 Standardizing the Train and Test Data

In contrast to the Ordinary Least Squares, Lasso and Ridge regression are highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the Lasso and Ridge regression, so that all the predictors are on the same scale.

5.3 Models

5.3.1 Ordinary Least Squares (OLS)

First, let’s build an Ordinary Least Squares model using train data and predict official time for test data.

From the above plots, we can observe that Pace between 15k and 20k and Pace between 20k and 21.1k(Half time) are top two important features in determining the official time. 91.7900286% variability in Official time is explained by the predictor variables and vif values shows that there exists a multi-collinearity between independent variables. So let’s build a model which will help us in solving multi-collinearity problem.

5.3.2 Ridge Regression

Ridge regression shrinks the coefficients of the independent variables to prevent multicollinearity. We need to calculate the regularization parameter lambda that adjusts the amount of coefficient shrinkage. The best lambda for the data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined using the function cv.glmnet().

The first vertical dotted line is where the lowest MSE is. The second vertical dotted line is within one standard error. The labels of above graph shows how many non-zero coefficients in the model. The best lambda value found here is 0.0934576. Let’s predict the Official time for test data using the best lambda value and calculate r-square value. Then fit the Ridge model on train dataset and predict the coefficients at best lambda value.

##          Age         M.FM    Pace0k.5k   Pace5k.10k  Pace10k.15k 
##   0.03490199   3.84633811   6.00007979   5.85997084   9.56233432 
##  Pace15k.20k Pace20k.Half 
##  11.56372032  11.26372657

The results of predicted coefficients of the features show that the pace between 15km and 20km and pace between 20km and 21.1km(Half time) are important features and r-square value is 91.2285772%.

5.3.3 The Lasso

Lasso regression also reduces the multi-collinearity between variables by shrinking the coefficients to zero. The same function glmnet( ) with alpha set to 1 will build the Lasso regression model.

Here, we see that the lowest MSE is when \(\lambda\) appro = 0.0046514. It has 5 non-zero coefficients at best lambda value. Let’s predict the Official time for test data using the best lambda value and calculate r-square value. Then fit the lasso model on train dataset and predict the coefficients at best lambda value.

##          Age         M.FM    Pace0k.5k  Pace10k.15k  Pace15k.20k 
##    0.0214768    4.5631016    6.6009531   11.3663131   14.0402185 
## Pace20k.Half 
##   12.5053273

The results of predicted coefficients of the features show that the pace between 15km and 20km and pace between 20km and 21.1km(Half time) are important features and r-square value is 91.5474328%. The main problem with lasso regression is when we have correlated variables, it sets some correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.

5.3.4 Decision tree

Let’s build a decision tree to predict the official time and see if there is any increase in accuracy of the model. We will perform grid search to identify optimal hyperparameter complexity parameter(cp) that has minimum cross validation error(xerror).

We got an optimal complexity parameter cp value of 0.01 with cross validation error xerror of 0.111. Now we apply that cp value to get the best final model with optimal tree size and then plot the decision tree for the model.

We got a decision tree with an optimal subtree of 7 splits, 8 terminal nodes, and a cross-validated error of 0.112 at cp value 0.01

We got an r-square value of 88.6343888%. The main disadvantage of decision tree is that, we get low bias and high variance predictions when we have sufficient depth in the tree. High variance means decision tree model changes a lot with changes in training data resulting in changes in accuracy. In order to control the variance, we go for a technique called Bagging.

5.3.5 Bagging Tree

In Bagging, we take ensemble of models having low bias and high variance as the base model. By doing ensembling of models, we get model with low bias and low variance. Below we are building an ensemble of decision trees using treebag method and plot variable importance.

We got an r-square value of 90.3081906% using Bagging Tree. The main disadvantage of bagging is that the predictions from the decision trees are highly correlated. In order to reduce the correlation between decision trees, we go for Random forest model.

5.3.6 Random Forest

Random Forest is an extension of Bagging where we subset the features along with bootstrap of rows with replacement.

We got an r-square value of 90.3081906%.

5.3.7 Gradient Boosting

Boosting is another approach for improving the predictions resulting from Decision Tree. In Gradient Boosting, we build ensemble of decision trees sequentially and the predictions of individual trees are summed sequentially. Every decision tree tries to recover the loss (difference between actual and predicted values) by fitting the tree on residuals of the previous tree. This results in model with better accuracy.

##                       var      rel.inf
## Pace15k.20k   Pace15k.20k 55.220474909
## Pace20k.Half Pace20k.Half 41.742211879
## Pace10k.15k   Pace10k.15k  1.627932048
## Pace0k.5k       Pace0k.5k  0.816999287
## M.F                   M.F  0.342694202
## Pace5k.10k     Pace5k.10k  0.246855787
## Age                   Age  0.002831888

We got an r-square value of 91.5356632%. The Pace between 15k and 20k and Pace between 20k and 21.1K are top two important variables in our gbm model.

5.3.8 xgboost

Extreme Gradient Boosting or xgboost gives better approximations of predictions over gradient boosting and computationally it is fast in training the data.

We got an r-square value of 92.4807346% using xgboost.

5.4 Conclusion

  • Out of all the models, we got a best r-square value of 92.4807346% with xgboost model and also computationally it is very fast in training the data compared to other models.

  • Feature importance values from all models show that Pace between 15k and 20k and Pace between 20k and 21.1K are the top two most important features in predicting the official time. So if a runner has better pace between 15k and 20k and pace 20k and 21.1k has the better chance of winning the race.

6 Bibliography

Registration. (2019). Retrieved October 18, 2019, from http://registration.baa.org/2017/cf/Public/iframe_EntryLists.cfm.